Week 2: Geocentric models

Categories and curves

Workspace setup:

As we develop more useful models, we’ll begin to practice the art of generating models with multiple estimands. An estimand is a quantity we want to estimate from the data. Our models may not themselves produce the answer to our central question, so we need to know how to calculate these values from the posterior distributions.

Categories

Forget dummy codes. From here on out, we will incorporate categorical causes into our models by using index variables. An index variable contains integers that correspond to different categories. The numbers have no inherent meaning – rather, they stand as placeholders or shorthand for categories.

data("Howell1")
d <- Howell1
d$sex <- ifelse(d$male == 1, 2, 1) # 1 = female, 2 = male
head(d[, c("male", "sex")])
  male sex
1    1   2
2    0   1
3    0   1
4    1   2
5    0   1
6    1   2

Mathematical model

Let’s write a mathematical model to express height in terms of sex.

\[\begin{align*} h_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{SEX[i]} \\ \alpha_j &\sim \text{Normal}(178, 20)\text{ for }j = 1..2 \\ \sigma &\sim \text{Uniform}(0, 50) \end{align*}\]

flist <- alist(
  height ~ dnorm( mu , sigma) ,
  mu <- a[sex] ,
  a[sex] ~ dnorm( 178, 20 ) ,
  sigma ~ dunif(0, 50)
)

Fitting the model using quap()

m1 <- quap(
  flist, data=d
)

precis(m1, depth=2)
           mean       sd      5.5%     94.5%
a[1]  134.91023 1.606922 132.34206 137.47840
a[2]  142.57812 1.697456 139.86525 145.29098
sigma  27.30972 0.828024  25.98638  28.63306

Here, we are given the estimates of the parameters specified in our model: the average height of women (a[1]) and the average height of men (a[2]). But our question is whether these average heights are different. How do we get that?

post <- extract.samples( m1 )
str(post)
List of 2
 $ sigma: num [1:10000] 27.2 26.8 29.9 25.6 27.2 ...
 $ a    : num [1:10000, 1:2] 136 133 133 134 134 ...
 - attr(*, "source")= chr "quap posterior: 10000 samples from m1"
head(post$a)
         [,1]     [,2]
[1,] 136.2863 141.6177
[2,] 132.7791 142.0095
[3,] 133.3070 142.8278
[4,] 133.6014 144.0224
[5,] 134.4843 142.0545
[6,] 134.7175 145.0595
post$diff_fm <- post$a[,1] - post$a[,2]
precis(post, depth=2 )
              mean       sd      5.5%      94.5%      histogram
sigma    27.312734 0.838853  25.96678  28.665512  ▁▁▁▁▃▇▇▇▃▂▁▁▁
a[1]    134.880703 1.619106 132.28021 137.439502 ▁▁▁▁▂▅▇▇▅▂▁▁▁▁
a[2]    142.576643 1.670353 139.89733 145.252898  ▁▁▁▂▃▇▇▇▃▂▁▁▁
diff_fm  -7.695941 2.314384 -11.41048  -3.981524     ▁▁▁▂▇▇▃▁▁▁

We can create two plots. One is the posterior distributions of average female and male heights and one is the average difference.

p1 <- post %>% as.data.frame() %>% 
  pivot_longer(starts_with("a")) %>% 
  mutate(sex = ifelse(name == "a.1", "female", "male")) %>% 
  ggplot(aes(x=value, color = sex)) +
  geom_density(linewidth = 2) +
  labs(x = "height(cm)") 

p2 <- post %>% as.data.frame() %>% 
  ggplot(aes(x=diff_fm)) +
  geom_density(linewidth = 2) +
  labs(x = "difference in height(cm)") 

( p1 | p2)

A note that the distributions of the mean heights is not the same as the distribution of heights period. For that, we need the posterior predictive distributions.

pred_f  <- rnorm(1e4, mean = post$a[,1], sd = post$sigma )
pred_m  <- rnorm(1e4, mean = post$a[,2], sd = post$sigma )

pred_post = data.frame(pred_f, pred_m) %>%
  mutate(diff = pred_f-pred_m)

# plot distributions
p1 <- pred_post %>% pivot_longer(starts_with("pred")) %>% 
  mutate(sex = ifelse(name == "pred_f", "female", "male")) %>% 
  ggplot(aes(x = value, color = sex)) +
  geom_density(linewidth = 2) +
  labs(x = "height (cm)")

# plot difference
# Compute density first
density_data <- density(pred_post$diff)

# Convert to a tibble for plotting
density_df <- tibble(
  x = density_data$x,
  y = density_data$y,
  fill_group = ifelse(x < 0, "male", "female")  # Define fill condition
)

# Plot with area fill
p2 <- ggplot(density_df, aes(x = x, y = y, fill = fill_group)) +
  geom_area() +  # Adjust transparency if needed
  geom_line(linewidth = 1.2, color = "black") +  # Keep one continuous curve
  labs(x = "Difference in height (F-M)", y = "density") +
  guides(fill = "none")

(p1 | p2)

Exercise

In the rethinking package, the dataset milk contains information about the composition of milk across primate species, as well as some other facts about those species. The taxonomic membership of each species is included in the variable clade; there are four categories.

  1. Create variable in the dataset to assign an index value to each of the 4 categories.
  2. Standardize the milk energy variable (kcal.per.g). 1
  3. Write a mathematical model to express the average milk energy (in standardized kilocalories) in each clade.
data("milk")
str(milk)
'data.frame':   29 obs. of  8 variables:
 $ clade         : Factor w/ 4 levels "Ape","New World Monkey",..: 4 4 4 4 4 2 2 2 2 2 ...
 $ species       : Factor w/ 29 levels "A palliata","Alouatta seniculus",..: 11 8 9 10 16 2 1 6 28 27 ...
 $ kcal.per.g    : num  0.49 0.51 0.46 0.48 0.6 0.47 0.56 0.89 0.91 0.92 ...
 $ perc.fat      : num  16.6 19.3 14.1 14.9 27.3 ...
 $ perc.protein  : num  15.4 16.9 16.9 13.2 19.5 ...
 $ perc.lactose  : num  68 63.8 69 71.9 53.2 ...
 $ mass          : num  1.95 2.09 2.51 1.62 2.19 5.25 5.37 2.51 0.71 0.68 ...
 $ neocortex.perc: num  55.2 NA NA NA NA ...
milk$clade_id <- as.integer(milk$clade)
milk$K <- standardize(milk$kcal.per.g)

\[\begin{align*} K_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{\text{CLAUDE}[i]} \\ \alpha_i &\sim \text{Normal}(0, 0.5) \text{ for }j=1..4 \\ \sigma &\sim \text{Exponential}(1) \\ \end{align*}\]

Exercise: Now fit your model using quap(). It’s ok if your mathematical model is a bit different from mine.

flist <- alist(
  K ~ dnorm( mu , sigma ) ,
  mu <- a[clade_id] , 
  a[clade_id] ~ dnorm( 0 , 0.5 ) , 
  sigma ~ dexp( 1 )
)

m2 <- quap(
  flist, data = milk
)

precis( m2, depth=2 )
            mean        sd        5.5%      94.5%
a[1]  -0.4843495 0.2176415 -0.83218259 -0.1365164
a[2]   0.3662533 0.2170592  0.01935085  0.7131558
a[3]   0.6752237 0.2575343  0.26363421  1.0868133
a[4]  -0.5858096 0.2745093 -1.02452851 -0.1470908
sigma  0.7196465 0.0965338  0.56536684  0.8739262

Plotting with rethinking

labels <- paste( "a[" , 1:4, "]:", levels(milk$clade),  sep="" )
plot(
  precis(m2, depth=2, pars = "a"),
  labels=labels, 
  xlab="expected kcal (std)"
)

Exercise

Plot the following distributions:

  • Posterior distribution of average milk energy by clade.
  • Posterior distribution of predicted milk energy values by clade.
post <- extract.samples( m2 )
names(labels) = paste("a.", 1:4, sep = "")
post %>% as.data.frame() %>% 
  pivot_longer(starts_with("a")) %>% 
  mutate(name = recode(name, !!!labels)) %>% 
  ggplot(aes(x = value, color = name)) +
  geom_density(linewidth = 2)
post <- extract.samples( m2 )
a.1 = rnorm(1e4, post$a[,1], post$sigma)
a.2 = rnorm(1e4, post$a[,2], post$sigma)
a.3 = rnorm(1e4, post$a[,3], post$sigma)
a.4 = rnorm(1e4, post$a[,4], post$sigma)
data.frame(a.1, a.2, a.3, a.4) %>% 
  pivot_longer(everything()) %>% 
  mutate(name = recode(name, !!!labels)) %>% 
  ggplot(aes(x = value, color = name)) +
  geom_density(linewidth = 2)